Genre Document Classification Using Flexible Length Phrases
نویسندگان
چکیده
In this paper we investigate possibility of using phrases of flexible length in genre classification of textual documents as an extension to classic bag of words document representation where documents are represented using single words as features. The investigation is conducted on collection of articles from document database collected from three different sources representing different genres: newspaper reports, abstracts of scientific articles and legal documents. The investigation includes comparison between classification results obtained by using classic bag of words representation and results obtained by using bag of words extended by flexible length phrases.
منابع مشابه
Combining classifiers for flexible genre categorization of web pages
With the increase of the number of web pages, it is very difficult to find wanted information easily and quickly out of thousands of web pages retrieved by a search engine. To solve this problem, many researches propose to classify documents according to their genre, which is another criteria to classify documents different from the topic. Most of these works assign a document to only one genre...
متن کاملContent-free Document Genre Classification using First Order Random Graphs
We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the lay...
متن کاملGenres of Digital Documents: Introduction to the Special Issue
Purpose – To introduce the special issue on “Genres of digital documents.” While there are many definitions of genre, most include consideration of the intended communicative purpose, form and sometimes expected content of a document. Most also include the notion of social acceptance, that a document is of a particular genre to the extent that it is recognized as such within a given discourse c...
متن کاملGenres of Digital Documents Introduction to the Special
Purpose: This article introduces the Special Issue on Genres of Digital Documents. While there are many definitions of genre, most include consideration of the intended communicative purpose, form and sometimes expected content of a document. Most also include the notion of social acceptance, that a document is of a particular genre to the extent that it is recognized as such within a given dis...
متن کاملFine-Grained Document Genre Classification Using First Order Random Graphs
We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our met...
متن کامل